Group 27 2021/11/25 (updated: 2021-12-11)
We use Ridge algorithm here to build a regression model to predict the popularity of spotify tracks based on features like danceability, loudness, tempo etc. The popularity score ranges from 0 to 100. A popularity score of 0 means the the song has minute popularity and a popularity score of 100 means the song is extremely popular.
Some songs sit atop popularity charts like the Billboard charts while certain other songs comfortably sit at the bottom of the charts. Some songs don’t even chart at all. This pose an interesting question to us on what exactly makes a song popular and we ask if we can be able to predict how popular a song will get. based on certain features. Some songs are unexceptionally popular while some other songs are not as popular. This is an attempt to answer this interesting question. We attempt here to make a prediction on the popularity of a song based on certain features.
According to this report, approximately 137 million new songs are released every year, and only about 14 records have sold 15 million physical copies or more in global history. Therefore, it is important to determine what exactly determines a track popularity and specifically make predictions on how popular a song will get based on based on features like danceability, loudness, tempo etc.
The dataset used in this project was sourced from Tidy Tuesday’s github repo here, and particularly here. The data, however, originally comes from Data.World, Billboard.com and Spotify. Each row from the dataset represents a song’s features and a target column specifying the song’s popularity on a scale of 0 (least popularity) to 100 (most popularity).
Ridge model was built to answer our research question (to predict spotify tracks popularity). This is a regression solution and predictions range from 0 (least popularity) to 100 (most popularity). All features in the original dataset were used to fit the model with the exceptions of ‘song_id,’ ‘spotify_track_id,’ ‘spotify_track_album’ features. A 10-fold cross-validation was used for hyperparameter optimization. The code used to perform this analysis can be found here.
It is usually very important to look at how the features are co-related and to see what their pairwise distributions look like. Here, the blue plots (and a fitting line) represents the paired distributions of the features, and the other boxes are the paired correlations of the features. As can be seen, the correlations are fair and not unreasonable, hence the features can be used together for building the Ridge model that seeks to answer the predictive question.